Data Collection Methods

Data by Affiliation

Datacite metadata were pulled using the rdatacite package in November 2022. Each of the six institutions were searched using the name of each University in the creators.affiliation.name metadata field. Results were filtered to include DOIs with a publicationYear of 2012 or later, and a resourceTypeGeneral of dataset or software. As the search terms returned other institutions with similar names, results were filtered to include DOIs only from the relevant institutional affiliations.

Following recommendations of the Crossref API, metadata was pulled from the April 2022 Public Release file (http://dx.doi.org/10.13003/83b2gq). DOIs were searched records with a created-dateparts year of 2012 or newer, that had a type of datasets (Crossref does not have software as an available type), and had an author affiliation with one of the six institutions.

Institutional repositories

Upon initial examination of the affiliation data, we realized that our own institutional repositories were not represented in the data because the affiliation metadata field was not completed as part of the DOI generation process.

To pull data shared in our institutional repositories as a comparison, a second search was performed to retrieve DOIs published by the institutional repositories at each university. For the institutional repositories using DataCite to issue DOIs (5 out of the 6 institutions at the time), the datacite API queried by names of the institutional repositories in the publisher metadata field. For the one institution using CrossRef to issue DOIs (Duke), the crossref API was used to retrieve all DOIs published using the Duke member prefixes.

Institutional repository data was then filtered to include only the relevant repositories, datasets and software resource types, and DOIs published in 2012 or later.

Affiliation data from datacite, affiliation data from cross ref, and the institutional repository data were combined into a single dataset.

Analysis

Load required packages and read in combined data.

#packages
pacman::p_load(dplyr, 
               tidyr, 
               ggplot2, 
               rjson,
               rdatacite,
               cowplot, 
               stringr, 
               knitr, 
               DT)



#Load the combined data from 3_Combined_data.R
load(file="data_rdata_files/Combined_ALL_data.Rdata")

#rename object
all_dois <- combined_dois 

#re-factor group so that datacite appears before cross ref
all_dois$group <- factor(all_dois$group, levels = c("Affiliation - Datacite", "Affiliation - CrossRef", "IR_publisher"))

Collapse DOIs by container

Some repositories (such as Harvard’s Dataverse and Qualitative Data Repository) assign DOIs at the level of the file, rather than the study. Similarly, Zenodo often has many related DOIs for multiple figures within a study. In order to attempt to compare study-to-study counts of data sharing, look at the DOIs collapsed by “container”.

by_container <- 
all_dois %>% 
  filter(!is.na(container_identifier)) %>% 
  group_by(container_identifier, publisher, title, institution) %>% 
  summarize(count=n()) %>% 
  arrange(desc(count))

How many publishers have container DOIs?

by_container %>% 
  group_by(publisher) %>% 
  summarize(count=n()) %>% 
  arrange(desc(count)) %>% 
  datatable

Collapsing by container for counts

containerdups <- which(!is.na(all_dois$container_identifier) & duplicated(all_dois$container_identifier))

all_dois_collapsed <- all_dois[-containerdups,]

This leaves a total of 165950 cases.

Overview of the data

DOI types by resource

all_dois_collapsed %>% 
  group_by(resourceTypeGeneral, group) %>% 
  summarize(count=n()) %>% 
  pivot_wider(names_from = group, 
              values_from = count, 
              values_fill = 0) %>% 
  kable()
resourceTypeGeneral Affiliation - Datacite Affiliation - CrossRef IR_publisher
Dataset 11572 147702 2103
Software 4512 0 61

DOI by institutional affiliation/publisher

all_dois_collapsed %>% 
  group_by(group, institution) %>% 
  summarize(count=n()) %>% 
  pivot_wider(names_from = group,
              values_from = count) %>% 
  kable()
institution Affiliation - Datacite Affiliation - CrossRef IR_publisher
Cornell 3921 706 174
Duke 2372 3603 225
Michigan 4188 141111 645
Minnesota 2408 1700 692
Virginia Tech 1553 64 333
Washington U 1642 518 95

Collapse IRs into a single category

Look at all the Institutional Repositories Captured

IR_pubs <- all_dois_collapsed %>% 
  filter(group == "IR_publisher") %>% 
  group_by(publisher_plus) %>% 
  summarize(count = n()) 

IR_pubs %>% 
  kable(col.names = c("Institutional Repository", "Count"))
Institutional Repository Count
Cornell 174
Duke-Duke Digital Repository 78
Duke-Research Data Repository, Duke University 147
Michigan 10
Michigan-Deep Blue 515
Michigan-ICPSR/ISR 109
Michigan-Other 11
Minnesota 692
Virginia Tech 333
Washington U 95

Replace all of these publishers with “Institutional Repository” so that they will be represented in a single bar.

all_dois_collapsed$publisher[which(all_dois_collapsed$publisher_plus %in% unique(IR_pubs$publisher_plus))] <- "Institutional Repository"

#catch the rest of the "Cornell University Library"
all_dois_collapsed$publisher[which(all_dois_collapsed$publisher == "Cornell University Library")] <- "Institutional Repository"

#and stray VT
all_dois_collapsed$publisher[which(all_dois_collapsed$publisher == "University Libraries, Virginia Tech")] <- "Institutional Repository"

#and DRUM
all_dois_collapsed$publisher[which(all_dois_collapsed$publisher == "Data Repository for the University of Minnesota (DRUM)")] <- "Institutional Repository"

##ICPSR is also inconsistent
all_dois_collapsed$publisher[grep("Consortium for Political", all_dois_collapsed$publisher)] <- "ICPSR"

Overall Count of Data and Software DOIs

Think we just keep these together for the main analysis…

by_publisher_collapse <- all_dois_collapsed %>% 
  group_by(publisher, institution) %>% 
  summarize(count=n()) %>% 
  arrange(institution, desc(count))

Table of publisher counts

by_publisher_collapse_table <- by_publisher_collapse %>% 
  pivot_wider(names_from = institution, 
              values_from = count, 
              values_fill = 0) %>% 
  rowwise %>% 
  mutate(Total = sum(c_across(Cornell:`Washington U`))) %>% 
  arrange(desc(Total))

by_publisher_collapse_table %>% 
  datatable

Write out the table of data & software publishers

write.csv(by_publisher_collapse_table, file="data_summary_data/Counts of  Publishers By Insitituion - Collapsed by container.csv", row.names = F)

Graphs

Top 8 publishers of data dois

by_publisher_dc_collapse <- all_dois_collapsed %>% 
  group_by(publisher, institution) %>% 
  summarize(count=n()) %>% 
  arrange(institution, desc(count))

#table of  publishers - data
by_publisher_dc_collapse_table <- by_publisher_dc_collapse %>% 
  pivot_wider(names_from = institution, 
              values_from = count, 
              values_fill = 0) %>% 
  rowwise %>% 
  mutate(Total = sum(c_across(Cornell:`Washington U`))) %>% 
  arrange(desc(Total))

Look at publishers based on rank of number of DOIs

by_publisher_dc_collapse_table %>% 
  group_by(publisher) %>% 
  summarize(count=sum(Total)) %>% 
  arrange(desc(count)) %>% 
  mutate(pubrank = order(count, decreasing = T)) %>% 
  ggplot(aes(x=pubrank, y=count)) +
  geom_bar(stat="identity") +
  scale_x_continuous(limits = c(0,25)) +
  labs(x = "Publisher Rank", y="Number of DOIs", title="Number of DOIs by top Publishers")+
  theme_bw() 

Look at the top 8 publishers - how many does this capture?

top8pubs <- by_publisher_dc_collapse_table$publisher[1:8]

by_publisher_dc_collapse_table %>% 
  group_by(publisher) %>% 
  summarize(count=sum(Total)) %>% 
  mutate(intop8pub = publisher %in% top8pubs) %>% 
  group_by(intop8pub) %>% 
  summarize(totalDOIs = sum(count), nrepos = n()) %>% 
  ungroup() %>% 
  mutate(propDOIs = totalDOIs/sum(totalDOIs))
## # A tibble: 2 × 4
##   intop8pub totalDOIs nrepos propDOIs
##   <lgl>         <int>  <int>    <dbl>
## 1 FALSE          2102    168   0.0127
## 2 TRUE         163848      8   0.987
top8colors <- c("Harvard Dataverse" = "dodgerblue2",
                "Zenodo" = "darkorange1",
                "ICPSR" = "darkcyan",
                "Dryad" = "lightgray", 
                "figshare" = "purple", 
                "Institutional Repository" = "lightblue", 
                "ENCODE Data Coordination Center" = "gold2", 
                "Faculty Opinions Ltd" = "darkgreen")



(by_publisher_plot_collapse <-  by_publisher_dc_collapse %>% 
    filter(publisher %in% top8pubs) %>% 
    ggplot(aes(x=institution, y=count, fill=publisher)) +
    geom_bar(stat="identity", position=position_dodge(preserve = "single")) +
    scale_fill_manual(values = top8colors, name="Publisher")+
    guides(fill = guide_legend(title.position = "top")) +
    scale_y_continuous(breaks = seq(from = 0, to=5000, by=500)) +
    coord_cartesian(ylim = c(0,5000)) +
    labs(x = "Institution", y="Count of Collapsed DOIs", caption = "Note: Michigan ENCODE bar cut off for scaling") +
    theme_bw() +
    theme(legend.position = "bottom", legend.title.align = .5))

ggsave(by_publisher_plot_collapse, filename = "figures/Counts of DOIs by Institution_DOIcollapsed.png", device = "png",  width = 8, height = 6, units="in")

Institutional Graphs - Collapsed

Cornell

Duke

Michigan

Minnesota

Virginia Tech

Wash U

Repository Poliferation by Year

How many different publishers are researchers sharing their data and how does this change over time?

by_year_nrepos <- all_dois_collapsed %>% 
  group_by(publicationYear, publisher, institution) %>% 
  summarize(nDOIs = n()) %>% 
  group_by(publicationYear, institution) %>% 
  summarize(npublishers = n(), totalDOIs = sum(nDOIs))

by_year_nrepos %>% 
  ggplot(aes(x=publicationYear, y=npublishers, group=institution)) +
  geom_line(aes(color=institution)) +
  labs(x="Year", 
       y="Number of Repositories", 
       title="Number of Repositories Where Data and Software are Shared Across Time") +
  theme_bw() +
  theme(legend.title = element_blank())

Further collapse by Version

We can also look at the data collapsed by version of a record. This was motivated because some repositories have multiple entries for the different versions of the same dataset/collection. And some entries have many versions.

Explore versions

Some Repositories attach “vX” to the doi.

all_dois_collapsed <- all_dois_collapsed %>% 
  mutate(hasversion = grepl("\\.v[[:digit:]]+$", DOI))


all_dois_collapsed %>% 
  filter(hasversion == TRUE) %>% 
  group_by(publisher, hasversion) %>% 
  summarize(count=n()) %>% 
  arrange(desc(count)) %>% 
  datatable()

Some repositories use the “VersionCount”

all_dois_collapsed %>% 
  filter(versionCount > 0) %>% 
  group_by(publisher) %>% 
  summarize(count=n(), AvgNversions = round(mean(versionCount),2)) %>% 
  arrange(desc(count)) %>% 
  datatable()

Some use “metadataVersion”

all_dois_collapsed %>% 
  filter(metadataVersion > 0) %>% 
  group_by(publisher) %>% 
  summarize(count=n(), AvgNversions = round(mean(metadataVersion),2)) %>% 
  arrange(desc(count)) %>% 
  datatable()

How to collapse by version? Maybe that’s for another day…

DataCite Affiliation data

Look at repositories with affiliation and publication years prior to 2014

DataCite released affiliation as a metadata option on Oct 16. 2014. The repositories with affiliations for things published before then may have been back-updated?

What repositories have publications with affiliation before then?

all_dois_collapsed %>% 
  group_by(publisher, publicationYear) %>% 
  summarize(count=n()) %>% 
  arrange(publicationYear) %>% 
  pivot_wider(names_from = publicationYear, 
              values_from = count) %>% 
  arrange(2012, 2013, 2014, 2015) %>% 
  datatable()

Completeness

Look at fields from datacite DOIs (as the data contain the most complete metadata from that source).

First, clean to unlist some of the fields

all_dois_collapsed$has_subjects <- unlist(lapply(all_dois_collapsed$subjects, function(x) length(x[[1]])))

all_dois_collapsed$has_dates <- unlist(lapply(all_dois_collapsed$dates, function(x) ifelse(length(x) > 0, nrow(x[[1]]), 0)))

all_dois_collapsed$has_relatedIdentifiers <- unlist(lapply(all_dois_collapsed$relatedIdentifiers, function(x) ifelse(length(x[[1]]) > 0, nrow(x[[1]]), 0)))

all_dois_collapsed$has_sizes <- unlist(lapply(all_dois_collapsed$sizes, function(x) length(x[[1]])))

all_dois_collapsed$has_rightsList <- unlist(lapply(all_dois_collapsed$rightsList, function(x) ifelse(length(x[[1]]) > 0, nrow(x[[1]]), 0)))

all_dois_collapsed$has_descriptions <- unlist(lapply(all_dois_collapsed$descriptions, function(x) ifelse("description" %in% names(x[[1]]), 1, 0)))

all_dois_collapsed$has_geolocations <- unlist(lapply(all_dois_collapsed$geoLocations, function(x) length(x[[1]])))

all_dois_collapsed$has_fundingReferences <- unlist(lapply(all_dois_collapsed$fundingReferences, function(x) ifelse(length(x[[1]]) > 0, nrow(x[[1]]), 0)))

all_dois_collapsed$has_formats <- unlist(lapply(all_dois_collapsed$formats, function(x) length(x[[1]])))

Then create dataset with indicators for whether fields have information in them (only indicates presence of information, not quality of information).

all_dois_collapsed_completeness <- 
all_dois_collapsed %>% 
  mutate(has_id = ifelse(!is.na(id), 1, 0), 
         has_publicationYear  = ifelse(!is.na(publicationYear), 1, 0), 
         has_URL = ifelse(!is.na(URL), 1, 0)) %>% 
  select(id, group, institution, publisher, starts_with("has_")) %>% 
  pivot_longer(cols=has_subjects:has_URL, 
               names_to = "variable", 
               values_to = "value") %>% 
  mutate(value_indc = ifelse(value == 0, 0, 1)) 

Completeness of Top Non-IR Repositories

by_publisher_complete_dc <- all_dois_collapsed_completeness %>% 
  filter(publisher %in% top8pubs) %>% 
  filter(group == "Affiliation - Datacite", 
         publisher != "Institutional Repository") %>% 
  group_by(publisher, variable) %>% 
  summarize(complete = sum(value_indc), total = n()) %>% 
  mutate(percent_complete = complete/total*100)

Dryad

Figshare

Harvard Dataverse

ICPSR

Zenodo

Completeness of IR Repositories

by_publisher_complete_ir <- all_dois_collapsed_completeness %>% 
 filter(publisher == "Institutional Repository") %>% 
  group_by(institution, variable) %>% 
  summarize(complete = sum(value_indc), total = n()) %>% 
  mutate(percent_complete = complete/total*100)

Cornell

Duke

NOTE: This will not be accurate because Duke metadata came from CrossRef

Michigan

Minnesota

Virginia Tech

Wash U

Look at Funder References across repositories

Look at the proportion of DOIs that have funder references filled out. Because not all data are the result of funding, will look at a similar fields (subjects - non-required field that applies equally to funded and non-funded works, and publication year - a required field to serve as proxy for total numebr of records) as a baseline completeness metric relevant to all DOIs.

by_publisher_complete_ir %>% 
  rename(publisher = institution) %>% 
  bind_rows(by_publisher_complete_dc) %>% 
  filter(variable %in% c("has_publicationYear", "has_fundingReferences", "has_subjects")) %>% 
  ggplot(aes(x=publisher, y=complete)) +
  geom_bar(stat = "identity", aes(fill=variable), position="dodge") + 
  scale_fill_hue(name="Metadata Field")+
  coord_flip() +
  labs(x="Repository/Publisher", y="Number of DOIs with completed field") +
  theme_bw()

Write out institutional data

Write out CSV files for each institution:

  • All DOIs
  • All DOIs collapsed
for (i in unique(all_dois$institution)) {
  all_dois %>% 
    filter(institution == i) %>% 
    write.csv(file=paste0("data_all_dois/All_dois_", i, gsub("-", "", Sys.Date()), ".csv"), row.names = F)
  
  all_dois_collapsed %>% 
    filter(institution == i) %>% 
    write.csv(file=paste0("data_all_dois/All_dois_collapsed_", i, gsub("-", "", Sys.Date()), ".csv"), row.names = F)
}